Code
install.packages("stringr")
install.packages("dplyr")
install.packages("flextable")
install.packages("checkdown")Martin Schweinberger
January 1, 2026


This tutorial introduces regular expressions (regex) and demonstrates how to use them when working with language data in R. A regular expression is a special sequence of characters that describes a search pattern. You can think of regular expressions as precision search tools — far more powerful than simple find-and-replace — that let you locate, extract, validate, and transform text based on its structure rather than its exact content.
Regular expressions have wide applications across linguistics and computational humanities: searching corpora for inflected forms, extracting named entities, cleaning OCR output, tokenising text, validating annotation schemes, and building text-processing pipelines. Once mastered, they become one of the most versatile tools in any language researcher’s toolkit.
By the end of this tutorial you will be able to:
., anchors, character classes, and POSIX classes\w, \d, \s) and understand the double-backslash requirement in Rstringr functions — str_detect(), str_extract(), str_replace(), and others — with regular expressionsdplyr pipelines for filtering and annotationBefore working through this tutorial, please complete or familiarise yourself with:
Martin Schweinberger. 2026. Regular Expressions in R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/regex/regex.html (Version 2026.03.28).
For further study, the following resources are highly recommended:
Install required packages (once only):
Load packages:
We will work with two types of objects throughout: a short example sentence for demonstrating individual patterns, and a longer example text representing realistic corpus data.
# Short example sentence for basic demonstrations
sent <- "The cat sat on the mat."
# A longer example text: an excerpt about linguistics
et <- paste(
"Grammar is the system of a language. People sometimes describe grammar as",
"the rules of a language, but in fact no language has rules. If we use the",
"word rules, we suggest that somebody created the rules first and then spoke",
"the language, like the rules of a game. But languages did not start like",
"that. Languages started when humans started to communicate with each other.",
"Grammars developed naturally. After some time, people described the grammar",
"of their languages. Languages change over time. Grammar changes too.",
"Children learn the grammar of their first language naturally. They do not",
"need to study it. Native speakers know intuitively whether a sentence is",
"grammatically correct or not. Non-native speakers often learn grammar rules",
"formally, through instruction. Prescriptive grammar describes how people",
"should speak, while descriptive grammar describes how people actually speak.",
"Linguists study grammars to understand language structure and acquisition.",
"The field of syntax deals with sentence structure, while morphology examines",
"how words are formed. Phonology studies sound systems in human languages.",
"Pragmatics investigates how context influences the interpretation of meaning.",
"Computational linguistics applies formal grammar to natural language processing.",
"Regular expressions are useful tools for searching and extracting patterns.",
"They can match words like 'cat', 'bat', or 'hat' with a single pattern."
)
# Split into individual tokens (words and punctuation)
tokens <- str_split(et, "\\s+") |> unlist()What you will learn: The building blocks of regular expressions — how each type of pattern works and what it matches.
Key concept: Regular expressions describe structure, not content. [aeiou]{2,} matches any sequence of two or more vowels, regardless of which vowels or in which word.
The simplest regular expression is a literal character — it matches exactly that character. A sequence of literal characters matches that exact sequence:
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
To match a literal dot (rather than “any character”), escape it with a double backslash:
[1] TRUE
[1] TRUE
[1] FALSE
In most programming languages, a single backslash \ is the regex escape character. In R strings, \ itself must be escaped, so regex escapes require double backslash \\. For example:
\\. in R code → \. as a regex → matches a literal dot\\b in R code → \b as a regex → matches a word boundary\\d in R code → \d as a regex → matches a digitThis double-backslash requirement catches many beginners. Remember: every \ you intend for regex needs to be written as \\ in R.
Anchors match positions in the string, not characters. They constrain where in the string a pattern can match.
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
\b is indispensable for corpus searches. Without it, searching for “the” would match “the” inside “other”, “there”, “ather”, and so on. Always use \\bword\\b when you want whole-word matches.
A character class [...] matches any single character from the set listed inside the brackets:
[[1]]
[1] "cat" "sat" "mat"
[[1]]
[1] "T" "h" "c" "t" "s" "t" "n" "t" "h" "m" "t" "."
[[1]]
[1] "e" "l" "l" "o" "o" "r" "l" "d"
[[1]]
[1] "H" "W"
[[1]]
[1] "1" "2" "3"
[[1]]
[1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d"
R supports POSIX character classes — named sets written inside [:..:] inside an outer [...]:
[[1]]
[1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d"
[[1]]
[1] "1" "2" "3"
[[1]]
[1] "," "!" "."
[[1]]
[1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d" "1" "2" "3"
[[1]]
[1] "\t" " " " "
The full set of POSIX classes available in R:
Class | Matches |
|---|---|
[:alpha:] | Any letter (a-z, A-Z) |
[:lower:] | Lowercase letters (a-z) |
[:upper:] | Uppercase letters (A-Z) |
[:digit:] | Digits (0-9) |
[:alnum:] | Letters and digits |
[:punct:] | Punctuation: . , ; : ! ? " ' ( ) [ ] { } / \ @ # $ % ^ & * - _ + = ~ ` | |
[:space:] | All whitespace: space, tab, newline, return, form-feed |
[:blank:] | Space and tab only |
[:graph:] | All visible characters (alnum + punct) |
[:print:] | Printable characters (graph + space) |
Quantifiers specify how many times the preceding element should match. The table below gives a complete overview:
Quantifier | Meaning | R example | Example matches |
|---|---|---|---|
* | 0 or more (greedy) | "b*" | "" "b" "bbb" |
+ | 1 or more (greedy) | "b+" | "b" "bbb" |
? | 0 or 1 — makes element optional (greedy) | "colou?r" | "color" "colour" |
{n} | Exactly n times | "[a-z]{5}" | "hello" "world" |
{n,} | n or more times (greedy) | "[a-z]{3,}" | "cat" "grammar" |
{n,m} | Between n and m times (greedy) | "[a-z]{3,6}" | "cat" "gram" "syntax" |
*? | 0 or more (lazy — as few as possible) | "<.*?>" | first tag only in "<b>bold</b>" |
+? | 1 or more (lazy — as few as possible) | "<.+?>" | first tag only in "<b>bold</b>" |
?? | 0 or 1 (lazy) | "colou??r" | "color" "colour" |
{n,m}? | Between n and m times (lazy) | "[a-z]{3,6}?" | shortest run of 3-6 letters |
[[1]]
[1] "" "" "bbb" "" "" "" "" ""
[[1]]
[1] "bbb"
[1] TRUE TRUE
[[1]]
[1] "communicat" "intuitivel" "grammatica" "instructio" "rescriptiv"
[6] "descriptiv" "understand" "acquisitio" "morphology" "investigat"
[11] "influences" "interpreta" "omputation" "linguistic" "processing"
[16] "expression" "extracting"
[[1]]
character(0)
[[2]]
character(0)
[[3]]
character(0)
[[4]]
character(0)
[[5]]
character(0)
[[6]]
character(0)
[[7]]
character(0)
[[8]]
character(0)
[[9]]
[1] "sometimes"
[[10]]
[1] "describe"
[[11]]
character(0)
[[12]]
character(0)
[[13]]
character(0)
[[14]]
character(0)
[[15]]
character(0)
[[16]]
character(0)
[[17]]
character(0)
[[18]]
character(0)
[[19]]
character(0)
[[20]]
character(0)
[[21]]
character(0)
[[22]]
[1] "language"
[[23]]
character(0)
[[24]]
character(0)
[[25]]
character(0)
[[26]]
character(0)
[[27]]
character(0)
[[28]]
character(0)
[[29]]
character(0)
[[30]]
character(0)
[[31]]
character(0)
[[32]]
character(0)
[[33]]
character(0)
[[34]]
[1] "somebody"
[[35]]
character(0)
[[36]]
character(0)
[[37]]
character(0)
[[38]]
character(0)
[[39]]
character(0)
[[40]]
character(0)
[[41]]
character(0)
[[42]]
character(0)
[[43]]
character(0)
[[44]]
character(0)
[[45]]
character(0)
[[46]]
character(0)
[[47]]
character(0)
[[48]]
character(0)
[[49]]
character(0)
[[50]]
character(0)
[[51]]
[1] "languages"
[[52]]
character(0)
[[53]]
character(0)
[[54]]
character(0)
[[55]]
character(0)
[[56]]
character(0)
[[57]]
[1] "Languages"
[[58]]
character(0)
[[59]]
character(0)
[[60]]
character(0)
[[61]]
character(0)
[[62]]
character(0)
[[63]]
[1] "communicate"
[[64]]
character(0)
[[65]]
character(0)
[[66]]
character(0)
[[67]]
[1] "Grammars"
[[68]]
[1] "developed"
[[69]]
character(0)
[[70]]
character(0)
[[71]]
character(0)
[[72]]
character(0)
[[73]]
character(0)
[[74]]
[1] "described"
[[75]]
character(0)
[[76]]
character(0)
[[77]]
character(0)
[[78]]
character(0)
[[79]]
character(0)
[[80]]
[1] "Languages"
[[81]]
character(0)
[[82]]
character(0)
[[83]]
character(0)
[[84]]
character(0)
[[85]]
character(0)
[[86]]
character(0)
[[87]]
[1] "Children"
[[88]]
character(0)
[[89]]
character(0)
[[90]]
character(0)
[[91]]
character(0)
[[92]]
character(0)
[[93]]
character(0)
[[94]]
[1] "language"
[[95]]
character(0)
[[96]]
character(0)
[[97]]
character(0)
[[98]]
character(0)
[[99]]
character(0)
[[100]]
character(0)
[ reached getOption("max.print") -- omitted 110 entries ]
[[1]]
character(0)
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "system"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
character(0)
[[8]]
[1] "People"
[[9]]
character(0)
[[10]]
character(0)
[[11]]
character(0)
[[12]]
character(0)
[[13]]
character(0)
[[14]]
[1] "rules"
[[15]]
character(0)
[[16]]
character(0)
[[17]]
character(0)
[[18]]
character(0)
[[19]]
character(0)
[[20]]
[1] "fact"
[[21]]
character(0)
[[22]]
character(0)
[[23]]
character(0)
[[24]]
character(0)
[[25]]
character(0)
[[26]]
character(0)
[[27]]
character(0)
[[28]]
character(0)
[[29]]
[1] "word"
[[30]]
character(0)
[[31]]
character(0)
[[32]]
character(0)
[[33]]
[1] "that"
[[34]]
character(0)
[[35]]
character(0)
[[36]]
character(0)
[[37]]
[1] "rules"
[[38]]
[1] "first"
[[39]]
character(0)
[[40]]
[1] "then"
[[41]]
[1] "spoke"
[[42]]
character(0)
[[43]]
character(0)
[[44]]
[1] "like"
[[45]]
character(0)
[[46]]
[1] "rules"
[[47]]
character(0)
[[48]]
character(0)
[[49]]
character(0)
[[50]]
character(0)
[[51]]
character(0)
[[52]]
character(0)
[[53]]
character(0)
[[54]]
[1] "start"
[[55]]
[1] "like"
[[56]]
character(0)
[[57]]
character(0)
[[58]]
character(0)
[[59]]
[1] "when"
[[60]]
[1] "humans"
[[61]]
character(0)
[[62]]
character(0)
[[63]]
character(0)
[[64]]
[1] "with"
[[65]]
[1] "each"
[[66]]
character(0)
[[67]]
character(0)
[[68]]
character(0)
[[69]]
character(0)
[[70]]
[1] "After"
[[71]]
[1] "some"
[[72]]
character(0)
[[73]]
[1] "people"
[[74]]
character(0)
[[75]]
character(0)
[[76]]
character(0)
[[77]]
character(0)
[[78]]
[1] "their"
[[79]]
character(0)
[[80]]
character(0)
[[81]]
[1] "change"
[[82]]
[1] "over"
[[83]]
character(0)
[[84]]
character(0)
[[85]]
character(0)
[[86]]
character(0)
[[87]]
character(0)
[[88]]
[1] "learn"
[[89]]
character(0)
[[90]]
character(0)
[[91]]
character(0)
[[92]]
[1] "their"
[[93]]
[1] "first"
[[94]]
character(0)
[[95]]
character(0)
[[96]]
[1] "They"
[[97]]
character(0)
[[98]]
character(0)
[[99]]
[1] "need"
[[100]]
character(0)
[ reached getOption("max.print") -- omitted 110 entries ]
By default, quantifiers are greedy — they match as much as possible. Adding ? after a quantifier makes it lazy — it matches as little as possible:
[1] "<b>bold</b> and <i>italic</i>"
[1] "<b>"
[[1]]
[1] "<b>" "</b>" "<i>" "</i>"
Parentheses () create a capturing group — a sub-pattern whose match can be referenced or extracted separately. The alternation operator | means OR within a group or pattern.
[1] TRUE TRUE FALSE
[[1]]
[1] "colour"
[[2]]
[1] "color"
[[1]]
character(0)
[1] TRUE
Use (?:...) when you need to group for alternation or quantification but do not need to capture the match:
Captured groups can be referred back to in replacement strings using \\1, \\2, etc.:
[1] "dogs and cats"
[1] "Grammar is the system of a **language**. People **sometimes** **describe** grammar as the rules of a **language**, but i"
R supports shorthand escape sequences for common character classes:
Sequence | Matches | Example (R string) |
|---|---|---|
\\w | Word characters: [[:alnum:]_] | "\\w+" |
\\W | Non-word characters: [^[:alnum:]_] | "\\W+" |
\\d | Digits: [[:digit:]] | "\\d+" |
\\D | Non-digits: [^[:digit:]] | "\\D+" |
\\s | Whitespace: [[:space:]] | "\\s+" |
\\S | Non-whitespace: [^[:space:]] | "\\S+" |
\\b | Word boundary (position) | "\\bcat\\b" |
\\B | Non-word boundary (position) | "\\Bcat\\B" |
[[1]]
[1] "price" "4" "99"
[[1]]
[1] "07" "3365" "1234" "07" "3346" "5678"
[1] "word1" "word2" "word3" "word4"
[[1]]
[1] "grammar"
Lookaround assertions match a position based on what comes before or after it, without including that context in the match. They are essential for extracting values that are preceded or followed by specific markers.
Syntax | Name | Matches |
|---|---|---|
(?=...) | Positive lookahead | Position followed by ... |
(?!...) | Negative lookahead | Position NOT followed by ... |
(?<=...) | Positive lookbehind | Position preceded by ... |
(?<!...) | Negative lookbehind | Position NOT preceded by ... |
[[1]]
[1] "12"
[[2]]
[1] "4"
[[3]]
[1] "7"
[[4]]
[1] "8"
[[1]]
[1] "12.99"
[[2]]
[1] "4.50"
[[3]]
character(0)
[[4]]
character(0)
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] "7.00"
[[4]]
[1] "8.95"
A linguistic example — extract words that come before a comma:
[[1]]
[1] "Grammar" "syntax"
Q1. What does the regex ^[A-Z] match?
Q2. What is the difference between colou?r and colo[u]?r?
Q3. You want to match words of exactly 5 characters that consist only of lowercase letters. Which pattern is correct?
stringr FunctionsWhat you will learn: The stringr functions used most frequently with regular expressions, and when to use each.
Key functions: str_detect(), str_count(), str_extract(), str_extract_all(), str_replace(), str_replace_all(), str_remove(), str_remove_all(), str_split(), str_locate()
The stringr package provides a consistent, user-friendly interface to regular expressions in R. All stringr functions follow the same pattern: the string comes first, the pattern second.
str_detect()Returns TRUE/FALSE for each string in a vector. Most commonly used for filtering:
[1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[1] "syntax" "morphology" "phonology" "pragmatics"
str_count()Counts non-overlapping occurrences of a pattern within each string:
str_extract() and str_extract_all()str_extract() returns the first match in each string. str_extract_all() returns all matches as a list:
[1] NA "synt" "rph" NA NA NA "ngr"
[[1]]
[1] "12" "99"
[[2]]
[1] "4" "12"
[[3]]
[1] "2024"
[1] "acquisition" "communicate" "Computational" "described"
[5] "describes" "descriptive" "developed" "expressions"
[9] "extracting" "grammatically" "influences" "instruction"
[13] "interpretation" "intuitively" "investigates" "languages"
[17] "Languages" "linguistics" "Linguists" "morphology"
[21] "naturally" "Phonology" "Pragmatics" "Prescriptive"
[25] "processing" "searching" "sometimes" "structure"
[29] "understand"
str_replace() and str_replace_all()Replace the first (or all) occurrence(s) of a pattern with a replacement string. Backreferences (\\1, \\2) refer to captured groups in the replacement:
[1] "The dog sat on the mat."
[1] "The dog dog on the dog."
[1] "dogs and cats"
[1] "Grammar is the system of a **language**. People **sometimes** **describe** grammar as the rules of a **language**, but i"
str_remove() and str_remove_all()Shorthand for str_replace(x, pattern, "") and str_replace_all(x, pattern, ""):
[1] "The cat sat on the mat"
[1] "Call us on --"
[1] "linguistics"
[1] "Grammar" "system" "People" "sometimes" "describe" "grammar"
[7] "rules" "fact" "language" "word"
str_split()Split strings on a pattern, returning a list:
[1] "the" "cat" "sat" "on" "the" "mat"
[1] "one" "two" "three" "four"
[1] "Grammar is the system of a language."
[2] "People sometimes describe grammar as the rules of a language, but in fact no language has rules."
[3] "If we use the word rules, we suggest that somebody created the rules first and then spoke the language, like the rules of a game."
str_locate()Returns the start and end positions of matches — useful when you need to know where in the string a pattern occurs:
start end
[1,] 64 70
start end
[1,] 64 70
[2,] 442 448
[3,] 538 544
[4,] 728 734
[5,] 786 792
[6,] 847 853
[7,] 1237 1243
stringr Functions
Q1. What is the difference between str_extract() and str_extract_all()?
Q2. You want to capitalise all words longer than 5 characters in a text. Which stringr function would you use?
What you will learn: How to apply regular expressions to realistic corpus linguistics and text processing tasks.
Tasks covered: Corpus searching, text cleaning, metadata extraction, frequency analysis, and dplyr integration.
A common corpus task is retrieving all contexts in which a pattern appears. We simulate a small multi-document corpus:
corpus <- data.frame(
doc_id = paste0("doc", 1:10),
register = rep(c("Academic", "News"), each = 5),
text = c(
"Grammar is the systematic study of the structure of a language.",
"Morphology examines how words are formed from smaller units called morphemes.",
"Syntax deals with the arrangement of words to form grammatical sentences.",
"Phonology studies the sound systems and phonological rules of languages.",
"Pragmatics investigates how context and intention affect meaning in communication.",
"Scientists announced a major breakthrough in natural language processing yesterday.",
"The new grammar checker software was released to the public on Monday morning.",
"Researchers found that bilingual speakers process syntax differently than monolinguals.",
"Language acquisition in children follows predictable phonological and syntactic stages.",
"The government launched a literacy program to improve grammar skills in schools."
),
stringsAsFactors = FALSE
) doc_id register
1 doc2 Academic
2 doc4 Academic
text
1 Morphology examines how words are formed from smaller units called morphemes.
2 Phonology studies the sound systems and phonological rules of languages.
doc_id ology_words
1 doc2 Morphology
2 doc4 Phonology
doc_id register n_grammar
1 doc1 Academic 1
2 doc7 News 1
3 doc10 News 1
4 doc2 Academic 0
5 doc3 Academic 0
6 doc4 Academic 0
7 doc5 Academic 0
8 doc6 News 0
9 doc8 News 0
10 doc9 News 0
# Count how often each linguistic subfield is mentioned
subfields <- c("syntax", "morphology", "phonology", "pragmatics", "grammar")
subfield_counts <- sapply(subfields, function(sf)
sum(str_count(corpus$text, regex(sf, ignore_case = TRUE))))
data.frame(subfield = subfields, count = subfield_counts) |>
dplyr::arrange(dplyr::desc(count)) |>
flextable() |>
flextable::set_table_properties(width = .4, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 12) |>
flextable::fontsize(size = 12, part = "header") |>
flextable::align_text_col(align = "center") |>
flextable::set_caption(caption = "Frequency of linguistic subfield terms in the corpus.") |>
flextable::border_outer()subfield | count |
|---|---|
grammar | 3 |
syntax | 2 |
morphology | 1 |
phonology | 1 |
pragmatics | 1 |
Regular expressions are the primary tool for cleaning raw corpus text:
raw_texts <- c(
" Grammar is the system of a language. ",
"Words like 'cat', 'bat', and 'hat' rhyme!",
"Phone: +61-7-3365-1234 | Email: info@uq.edu.au",
"Chapter 4: Syntax (pp. 112--145) — see also §3.2",
"The year\t2024\twas notable for advances in NLP."
)
raw_texts |>
# Normalise whitespace (collapse multiple spaces/tabs to one)
str_replace_all("\\s+", " ") |>
# Remove leading and trailing whitespace
str_trim() |>
# Remove content in parentheses
str_remove_all("\\(.*?\\)") |>
# Remove section references (§3.2 etc.)
str_remove_all("§\\d+\\.\\d+") |>
# Remove em dashes and following spaces
str_remove_all("—\\s*") |>
# Trim again after removals
str_trim()[1] "Grammar is the system of a language."
[2] "Words like 'cat', 'bat', and 'hat' rhyme!"
[3] "Phone: +61-7-3365-1234 | Email: info@uq.edu.au"
[4] "Chapter 4: Syntax see also"
[5] "The year 2024 was notable for advances in NLP."
A powerful application of regex is extracting structured information from free text:
# Simulate file names with embedded metadata
file_names <- c(
"speaker01_female_academic_2019.txt",
"speaker14_male_news_2021.txt",
"speaker07_female_fiction_2020.txt",
"speaker23_male_academic_2022.txt"
)
# Extract each metadata component
data.frame(
filename = file_names,
speaker_id = str_extract(file_names, "speaker\\d+"),
gender = str_extract(file_names, "(?<=_)(female|male)(?=_)"),
register = str_extract(file_names, "(?<=_(female|male)_)\\w+"),
year = str_extract(file_names, "\\d{4}")
) filename speaker_id gender register year
1 speaker01_female_academic_2019.txt speaker01 female academic_2019 2019
2 speaker14_male_news_2021.txt speaker14 male news_2021 2021
3 speaker07_female_fiction_2020.txt speaker07 female fiction_2020 2020
4 speaker23_male_academic_2022.txt speaker23 male academic_2022 2022
By default, regex in stringr is case-sensitive. Use regex(..., ignore_case = TRUE) to match regardless of case:
[1] TRUE TRUE TRUE TRUE
[1] "Grammar" "grammar" "Grammars" "grammar" "Grammar" "grammar"
[7] "grammar" "grammar" "grammar" "grammars" "grammar"
dplyr pipelinesRegular expressions integrate seamlessly with dplyr for filtering and creating new columns:
corpus |>
dplyr::filter(str_detect(text, regex("syntax|morphology", ignore_case = TRUE))) |>
dplyr::mutate(
primary_topic = str_extract(text,
regex("syntax|morphology|phonology|pragmatics|grammar",
ignore_case = TRUE)),
n_words = str_count(text, "\\S+"),
has_definition = str_detect(text, "\\bis\\b|\\bdeals with\\b|\\bexamines\\b")
) |>
dplyr::select(doc_id, register, primary_topic, n_words, has_definition) doc_id register primary_topic n_words has_definition
1 doc2 Academic Morphology 11 TRUE
2 doc3 Academic Syntax 11 TRUE
3 doc8 News syntax 10 FALSE
Q1. What regular expression would you use to extract all words that contain at least one digit (e.g., “A4”, “mp3”, “COVID-19”)?
Q2. You want to extract the domain name from email addresses (the part after @ and before the final .). Which regex extracts uq from user@uq.edu.au?
Q3. What does str_replace_all(text, \"(\\\\w+) and (\\\\w+)\", \"\\\\2 and \\\\1\") do?
Ten practical exercises covering the most common corpus-search regex tasks.
Each question asks you to identify the correct regular expression for a realistic search task on a tokenised text vector. All answers use stringr::str_detect() applied to a character vector called text.
Q1. Which regex extracts all forms of walk from a tokenised text (walk, walks, walked, walking, walker)?
Q2. Which regex extracts all words beginning with “un” (e.g., ungrammatical, unusual, undo)?
Q3. Which regex finds all numeric tokens (whole numbers like 2024, 42, 100)?
Q4. Which regex extracts all words ending in -ing (e.g., running, working, thinking)?
Q5. Which regex matches email addresses (e.g., cat@uq.edu.au, info@ladal.edu.au)?
Q6. Which regex identifies tokens that contain at least one digit mixed with letters (e.g., mp3, A4, COVID-19, type2)?
Q7. Which regex extracts hyphenated compound words (e.g., well-being, self-aware, long-term)?
Q8. Which regex finds capitalised tokens — words beginning with an uppercase letter followed by lowercase letters (e.g., proper nouns like London, Paris, Grammar)?
Q9. Which regex finds tokens that are questions ending with a question mark (e.g., you?, this?)?
Q10. Which regex finds tokens containing double vowels (e.g., agreement, book, see, moon)?
A compact reference for the most commonly used regex elements in R.
Pattern | Meaning |
|---|---|
. | Any character except newline |
^ | Start of string / line |
$ | End of string / line |
\\b | Word boundary |
\\B | Non-word boundary |
[abc] | One of: a, b, or c |
[^abc] | Not a, b, or c |
[a-z] | Lowercase letter |
[[:alpha:]] | Any letter |
[[:digit:]] | Any digit |
[[:punct:]] | Any punctuation |
* | 0 or more (greedy) |
+ | 1 or more (greedy) |
? | 0 or 1 — optional (greedy) |
{n} | Exactly n times |
{n,} | n or more times (greedy) |
{n,m} | Between n and m times (greedy) |
*? | 0 or more (lazy) |
+? | 1 or more (lazy) |
{n,m}? | Between n and m times (lazy) |
(abc) | Capturing group |
(?:abc) | Non-capturing group |
a|b | a or b |
\\w | Word character [a-zA-Z0-9_] |
\\d | Digit [0-9] |
\\s | Whitespace |
\\W | Non-word character |
\\D | Non-digit |
\\S | Non-whitespace |
(?=...) | Positive lookahead |
(?!...) | Negative lookahead |
(?<=...) | Positive lookbehind |
(?<!...) | Negative lookbehind |
stringr function summaryFunction | Returns |
|---|---|
str_detect(x, p) | logical vector — does p match? |
str_count(x, p) | integer vector — how many matches? |
str_extract(x, p) | character vector — first match (NA if none) |
str_extract_all(x, p) | list of character vectors — all matches |
str_replace(x, p, r) | character vector — first match replaced |
str_replace_all(x, p, r) | character vector — all matches replaced |
str_remove(x, p) | character vector — first match removed |
str_remove_all(x, p) | character vector — all matches removed |
str_split(x, p) | list of character vectors — parts between matches |
str_locate(x, p) | integer matrix — start and end of first match |
str_locate_all(x, p) | list of integer matrices — all match positions |
str_starts(x, p) | logical — does x start with p? |
str_ends(x, p) | logical — does x end with p? |
Martin Schweinberger. 2026. Regular Expressions in R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/regex/regex.html (Version 2026.03.28), doi: .
@manual{martinschweinberger2026regular,
author = {Martin Schweinberger},
title = {Regular Expressions in R},
year = {2026},
note = {https://ladal.edu.au/tutorials/regex/regex.html},
organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
edition = {2026.03.28}
doi = {}
}
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Australia/Brisbane
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] checkdown_0.0.13 flextable_0.9.11 lubridate_1.9.4 forcats_1.0.0
[5] stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1 readr_2.1.5
[9] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] generics_0.1.4 fontLiberation_0.1.0 renv_1.1.7
[4] xml2_1.3.6 stringi_1.8.7 hms_1.1.4
[7] digest_0.6.39 magrittr_2.0.4 evaluate_1.0.5
[10] grid_4.4.2 timechange_0.3.0 RColorBrewer_1.1-3
[13] fastmap_1.2.0 jsonlite_2.0.0 zip_2.3.2
[16] BiocManager_1.30.27 scales_1.4.0 fontBitstreamVera_0.1.1
[19] codetools_0.2-20 textshaping_1.0.0 cli_3.6.5
[22] rlang_1.1.7 fontquiver_0.2.1 litedown_0.9
[25] commonmark_2.0.0 withr_3.0.2 yaml_2.3.10
[28] gdtools_0.5.0 tools_4.4.2 officer_0.7.3
[31] uuid_1.2-1 tzdb_0.5.0 vctrs_0.7.2
[34] R6_2.6.1 lifecycle_1.0.5 htmlwidgets_1.6.4
[37] ragg_1.5.1 pkgconfig_2.0.3 pillar_1.11.1
[40] gtable_0.3.6 glue_1.8.0 data.table_1.17.0
[43] Rcpp_1.1.1 systemfonts_1.3.1 xfun_0.56
[46] tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.51
[49] farver_2.1.2 patchwork_1.3.0 htmltools_0.5.9
[52] rmarkdown_2.30 compiler_4.4.2 S7_0.2.1
[55] markdown_2.0 askpass_1.2.1 openssl_2.3.2
This tutorial was re-developed with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the checkdown quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.
---
title: "Regular Expressions in R"
author: "Martin Schweinberger"
date: "2026"
params:
title: "Regular Expressions in R"
author: "Martin Schweinberger"
year: "2026"
version: "2026.03.28"
url: "https://ladal.edu.au/tutorials/regex/regex.html"
institution: "The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia"
description: "This tutorial introduces regular expressions (regex) in R, covering pattern syntax, character classes, quantifiers, anchors, and practical applications for finding and replacing patterns in text data. It is aimed at researchers in linguistics and digital humanities who need to perform sophisticated text search and processing tasks."
doi: "10.5281/zenodo.19332943"
format:
html:
toc: true
toc-depth: 4
code-fold: show
code-tools: true
theme: cosmo
---
```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
library(checkdown)
library(stringr)
library(dplyr)
library(flextable)
options(stringsAsFactors = FALSE)
options(scipen = 100)
options(max.print = 100)
```
{ width=100% }
# Introduction {#intro}
{ width=15% style="float:right; padding:10px" }
This tutorial introduces **regular expressions** (regex) and demonstrates how to use them when working with language data in R. A regular expression is a special sequence of characters that describes a search pattern. You can think of regular expressions as precision search tools — far more powerful than simple find-and-replace — that let you locate, extract, validate, and transform text based on its structure rather than its exact content.
Regular expressions have wide applications across linguistics and computational humanities: searching corpora for inflected forms, extracting named entities, cleaning OCR output, tokenising text, validating annotation schemes, and building text-processing pipelines. Once mastered, they become one of the most versatile tools in any language researcher's toolkit.
::: {.callout-note}
## Learning Objectives
By the end of this tutorial you will be able to:
1. Explain what a regular expression is and how it differs from a simple string search
2. Construct patterns using literal characters, the wildcard `.`, anchors, character classes, and POSIX classes
3. Apply quantifiers — including greedy and lazy variants — to specify repetition
4. Use capturing groups, non-capturing groups, and alternation
5. Use shorthand escape sequences (`\w`, `\d`, `\s`) and understand the double-backslash requirement in R
6. Write lookahead and lookbehind assertions for context-sensitive matching
7. Apply the key `stringr` functions — `str_detect()`, `str_extract()`, `str_replace()`, and others — with regular expressions
8. Use regular expressions for practical corpus tasks: concordance searches, text cleaning, metadata extraction, and frequency analysis
9. Integrate regex with `dplyr` pipelines for filtering and annotation
:::
::: {.callout-note}
## Prerequisite Tutorials
Before working through this tutorial, please complete or familiarise yourself with:
- [Getting Started with R and RStudio](/tutorials/intror/intror.html)
- [Loading, Saving, and Generating Data in R](/tutorials/load/load.html)
- [String Processing in R](/tutorials/string/string.html)
:::
::: {.callout-note}
## Citation
```{r citation-callout-top, echo=FALSE, results='asis'}
cat(
params$author, ". ",
params$year, ". *",
params$title, "*. ",
params$institution, ". ",
"url: ", params$url, " ",
"(Version ", params$version, ").",
sep = ""
)
```
:::
::: {.callout-note}
## External Resources
For further study, the following resources are highly recommended:
- @friedl2006mastering — the definitive reference on regular expressions
- Chapter 17 of @peng2016r — a practical introduction in an R context
- [RStudio Regex Cheatsheet](https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf) — a quick-reference card by Ian Kopacka
- Nick Thieberger's [Introduction to Regular Expressions](https://www.youtube.com/watch?v=8ILToE0CNpM) — a YouTube tutorial aimed at humanities scholars
:::
---
## Preparation and Session Set-up {-}
Install required packages (once only):
```{r install, eval=FALSE}
install.packages("stringr")
install.packages("dplyr")
install.packages("flextable")
install.packages("checkdown")
```
Load packages:
```{r load-packages, message=FALSE, warning=FALSE}
library(stringr) # string manipulation and regex functions
library(dplyr) # data frame manipulation
library(flextable) # formatted tables
library(checkdown) # interactive exercises
options(stringsAsFactors = FALSE)
options(scipen = 100)
```
We will work with two types of objects throughout: a short **example sentence** for demonstrating individual patterns, and a longer **example text** representing realistic corpus data.
```{r sample-data}
# Short example sentence for basic demonstrations
sent <- "The cat sat on the mat."
# A longer example text: an excerpt about linguistics
et <- paste(
"Grammar is the system of a language. People sometimes describe grammar as",
"the rules of a language, but in fact no language has rules. If we use the",
"word rules, we suggest that somebody created the rules first and then spoke",
"the language, like the rules of a game. But languages did not start like",
"that. Languages started when humans started to communicate with each other.",
"Grammars developed naturally. After some time, people described the grammar",
"of their languages. Languages change over time. Grammar changes too.",
"Children learn the grammar of their first language naturally. They do not",
"need to study it. Native speakers know intuitively whether a sentence is",
"grammatically correct or not. Non-native speakers often learn grammar rules",
"formally, through instruction. Prescriptive grammar describes how people",
"should speak, while descriptive grammar describes how people actually speak.",
"Linguists study grammars to understand language structure and acquisition.",
"The field of syntax deals with sentence structure, while morphology examines",
"how words are formed. Phonology studies sound systems in human languages.",
"Pragmatics investigates how context influences the interpretation of meaning.",
"Computational linguistics applies formal grammar to natural language processing.",
"Regular expressions are useful tools for searching and extracting patterns.",
"They can match words like 'cat', 'bat', or 'hat' with a single pattern."
)
# Split into individual tokens (words and punctuation)
tokens <- str_split(et, "\\s+") |> unlist()
```
---
# Regular Expression Patterns {#patterns}
::: {.callout-note}
## Section Overview
**What you will learn:** The building blocks of regular expressions — how each type of pattern works and what it matches.
**Key concept:** Regular expressions describe structure, not content. `[aeiou]{2,}` matches any sequence of two or more vowels, regardless of which vowels or in which word.
:::
## Basic characters {-}
The simplest regular expression is a **literal character** — it matches exactly that character. A sequence of literal characters matches that exact sequence:
```{r basic-chars}
# Literal match: does "cat" appear in the sentence?
str_detect(sent, "cat")
# The dot . matches ANY single character except newline
str_detect(sent, "c.t") # matches "cat"
str_detect(sent, "m.t") # matches "mat"
str_detect(sent, ".at") # matches "cat", "sat", "mat"
```
To match a literal dot (rather than "any character"), **escape it** with a double backslash:
```{r literal-dot}
# Match a literal period at the end of the sentence
str_detect(sent, "\\.") # TRUE — the sentence ends with a full stop
# Without escaping, . matches any character:
str_detect("abc", ".") # TRUE — any character matches
str_detect("abc", "\\.") # FALSE — no literal dot in "abc"
```
::: {.callout-tip}
## The double backslash in R
In most programming languages, a single backslash `\` is the regex escape character. In R strings, `\` itself must be escaped, so regex escapes require **double** backslash `\\`. For example:
- `\\.` in R code → `\.` as a regex → matches a literal dot
- `\\b` in R code → `\b` as a regex → matches a word boundary
- `\\d` in R code → `\d` as a regex → matches a digit
This double-backslash requirement catches many beginners. Remember: every `\` you intend for regex needs to be written as `\\` in R.
:::
## Anchors {-}
Anchors match **positions** in the string, not characters. They constrain where in the string a pattern can match.
```{r anchors}
# ^ matches the START of the string
str_detect(sent, "^The") # TRUE — "The" is at the start
str_detect(sent, "^cat") # FALSE — "cat" is not at the start
# $ matches the END of the string
str_detect(sent, "mat\\.$") # TRUE — "mat." is at the end
str_detect(sent, "cat$") # FALSE — "cat" is not at the end
# \b matches a WORD BOUNDARY (between a word char and a non-word char)
str_detect("catalogue", "\\bcat\\b") # FALSE — "cat" is part of a word
str_detect("the cat sat", "\\bcat\\b") # TRUE — "cat" is a whole word
# \B matches where \b does NOT (i.e., inside a word)
str_detect("catalogue", "\\Bcat\\B") # FALSE — "cat" is at word START
str_detect("concatenate", "\\Bcat\\B") # TRUE — "cat" is in the middle
```
::: {.callout-tip}
## Word boundaries in corpus searches
`\b` is indispensable for corpus searches. Without it, searching for "the" would match "the" inside "other", "there", "ather", and so on. Always use `\\bword\\b` when you want whole-word matches.
:::
## Character classes {-}
A **character class** `[...]` matches any single character from the set listed inside the brackets:
```{r char-classes}
# Match 'c', 's', or 'm' followed by 'at'
str_extract_all(sent, "[csm]at")
# Negated class [^...]: match any character NOT in the set
str_extract_all(sent, "[^aeiou ]") # non-vowel, non-space characters
# Ranges
str_extract_all("Hello World 123", "[a-z]") # lowercase letters
str_extract_all("Hello World 123", "[A-Z]") # uppercase letters
str_extract_all("Hello World 123", "[0-9]") # digits
str_extract_all("Hello World 123", "[a-zA-Z]") # all letters
```
### POSIX character classes {-}
R supports **POSIX character classes** — named sets written inside `[:..:]` inside an outer `[...]`:
```{r posix}
str_extract_all("Hello, World! 123.", "[[:alpha:]]") # letters only
str_extract_all("Hello, World! 123.", "[[:digit:]]") # digits only
str_extract_all("Hello, World! 123.", "[[:punct:]]") # punctuation only
str_extract_all("Hello, World! 123.", "[[:alnum:]]") # letters and digits
str_extract_all("Hello\tWorld 123", "[[:blank:]]") # spaces and tabs
```
The full set of POSIX classes available in R:
```{r posix-table, echo=FALSE}
data.frame(
Class = c("[:alpha:]", "[:lower:]", "[:upper:]", "[:digit:]",
"[:alnum:]", "[:punct:]", "[:space:]", "[:blank:]",
"[:graph:]", "[:print:]"),
Matches = c("Any letter (a-z, A-Z)",
"Lowercase letters (a-z)",
"Uppercase letters (A-Z)",
"Digits (0-9)",
"Letters and digits",
"Punctuation: . , ; : ! ? \" ' ( ) [ ] { } / \\ @ # $ % ^ & * - _ + = ~ ` |",
"All whitespace: space, tab, newline, return, form-feed",
"Space and tab only",
"All visible characters (alnum + punct)",
"Printable characters (graph + space)")
) |>
flextable() |>
flextable::set_table_properties(width = .9, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "POSIX character classes available in R.") |>
flextable::border_outer()
```
## Quantifiers {-}
**Quantifiers** specify how many times the preceding element should match. The table below gives a complete overview:
```{r quantifier-table, echo=FALSE}
data.frame(
Quantifier = c("*", "+", "?", "{n}", "{n,}", "{n,m}",
"*?", "+?", "??", "{n,m}?"),
Meaning = c("0 or more (greedy)",
"1 or more (greedy)",
"0 or 1 — makes element optional (greedy)",
"Exactly n times",
"n or more times (greedy)",
"Between n and m times (greedy)",
"0 or more (lazy — as few as possible)",
"1 or more (lazy — as few as possible)",
"0 or 1 (lazy)",
"Between n and m times (lazy)"),
`R example` = c('"b*"',
'"b+"',
'"colou?r"',
'"[a-z]{5}"',
'"[a-z]{3,}"',
'"[a-z]{3,6}"',
'"<.*?>"',
'"<.+?>"',
'"colou??r"',
'"[a-z]{3,6}?"'),
`Example matches` = c(
'"" "b" "bbb"',
'"b" "bbb"',
'"color" "colour"',
'"hello" "world"',
'"cat" "grammar"',
'"cat" "gram" "syntax"',
'first tag only in "<b>bold</b>"',
'first tag only in "<b>bold</b>"',
'"color" "colour"',
'shortest run of 3-6 letters'),
check.names = FALSE
) |>
flextable() |>
flextable::set_table_properties(width = .99, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "Quantifiers in regular expressions: greedy (default) and lazy (? suffix) variants.") |>
flextable::border_outer()
```
```{r quantifiers}
# * : 0 or more
str_extract_all("aabbbcccc", "b*") # matches "", "bbb", ""...
# + : 1 or more
str_extract_all("aabbbcccc", "b+") # matches "bbb"
# ? : 0 or 1 (makes the element optional)
str_detect(c("color", "colour"), "colou?r") # both TRUE
# {n} : exactly n
str_extract_all(et, "[a-z]{10}") # exactly 10-letter sequences
# {n,} : n or more
str_extract_all(tokens, "^[[:alpha:]]{8,}$") # words of 8+ letters
# {n,m} : between n and m
str_extract_all(tokens, "^[[:alpha:]]{4,6}$") # words of 4-6 letters
```
### Greedy vs. lazy matching {-}
By default, quantifiers are **greedy** — they match as much as possible. Adding `?` after a quantifier makes it **lazy** — it matches as little as possible:
```{r greedy-lazy}
html <- "<b>bold</b> and <i>italic</i>"
# Greedy: matches from first < to LAST >
str_extract(html, "<.+>")
# Lazy: matches from first < to next >
str_extract(html, "<.+?>")
# Extract each tag individually (lazy)
str_extract_all(html, "<.+?>")
```
## Groups and alternation {-}
**Parentheses** `()` create a **capturing group** — a sub-pattern whose match can be referenced or extracted separately. The **alternation operator** `|` means OR within a group or pattern.
```{r groups}
# Alternation: match "cat" OR "dog"
str_detect(c("I have a cat", "I have a dog", "I have a fish"),
"cat|dog")
# Alternation inside a group: match "colour" OR "color"
str_extract_all(c("British colour", "American color"), "colo(u|)r")
# Match all forms of "walk": walk, walks, walked, walking
str_extract_all(et, "walk(s|ed|ing|er)?")
# Groups allow repetition of a sub-pattern
str_detect("abababab", "(ab)+") # matches one or more "ab"
```
### Non-capturing groups {-}
Use `(?:...)` when you need to group for alternation or quantification but do not need to capture the match:
```{r non-capture}
# Group for alternation without capturing
str_extract_all(et, "(?:gram|morpho|phono)logy")
```
### Backreferences in replacements {-}
Captured groups can be referred back to in replacement strings using `\\1`, `\\2`, etc.:
```{r backrefs}
# Swap the two words on either side of "and"
str_replace_all("cats and dogs", "(\\w+) and (\\w+)", "\\2 and \\1")
# Wrap all long words in asterisks (using \\0 for the whole match)
str_replace_all(et, "\\b[[:alpha:]]{8,}\\b", "**\\0**") |>
substr(1, 120)
```
## Special escape sequences {-}
R supports **shorthand escape sequences** for common character classes:
```{r escape-table, echo=FALSE}
data.frame(
Sequence = c("\\\\w", "\\\\W", "\\\\d", "\\\\D", "\\\\s", "\\\\S",
"\\\\b", "\\\\B"),
Matches = c("Word characters: [[:alnum:]_]",
"Non-word characters: [^[:alnum:]_]",
"Digits: [[:digit:]]",
"Non-digits: [^[:digit:]]",
"Whitespace: [[:space:]]",
"Non-whitespace: [^[:space:]]",
"Word boundary (position)",
"Non-word boundary (position)"),
`Example (R string)` = c('"\\\\w+"', '"\\\\W+"', '"\\\\d+"', '"\\\\D+"',
'"\\\\s+"', '"\\\\S+"', '"\\\\bcat\\\\b"', '"\\\\Bcat\\\\B"'),
check.names = FALSE
) |>
flextable() |>
flextable::set_table_properties(width = .95, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "Special escape sequences for common character classes (written as R strings).") |>
flextable::border_outer()
```
```{r escapes-demo}
# \w: word characters
str_extract_all("price: $4.99!", "\\w+")
# \d: digits
str_extract_all("Call 07 3365 1234 or 07 3346 5678", "\\d+")
# \s: whitespace (useful for splitting on any whitespace)
str_split("word1 word2\tword3\nword4", "\\s+")[[1]]
# \b: whole-word match
str_extract_all("grammar, grammarian, ungrammatical", "\\bgrammar\\b")
```
## Lookahead and lookbehind {-}
**Lookaround** assertions match a position based on what comes before or after it, without including that context in the match. They are essential for extracting values that are preceded or followed by specific markers.
```{r lookaround-table, echo=FALSE}
data.frame(
Syntax = c("(?=...)", "(?!...)", "(?<=...)", "(?<!...)"),
Name = c("Positive lookahead", "Negative lookahead",
"Positive lookbehind", "Negative lookbehind"),
Matches = c("Position followed by ...",
"Position NOT followed by ...",
"Position preceded by ...",
"Position NOT preceded by ...")
) |>
flextable() |>
flextable::set_table_properties(width = .85, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "Lookahead and lookbehind assertions.") |>
flextable::border_outer()
```
```{r lookaround}
prices <- c("$12.99", "$4.50", "USD 7.00", "8.95 EUR")
# Positive lookahead: match digits followed by a dot
str_extract_all(prices, "\\d+(?=\\.)")
# Positive lookbehind: match digits preceded by "$"
str_extract_all(prices, "(?<=\\$)\\d+\\.\\d+")
# Negative lookbehind: match numbers NOT preceded by "$"
str_extract_all(prices, "(?<!\\$)\\b\\d+\\.\\d+")
```
A linguistic example — extract words that come before a comma:
```{r lookahead-ling}
sample_text <- "Grammar, syntax, and morphology are core subfields of linguistics."
str_extract_all(sample_text, "\\w+(?=,)")
```
---
::: {.callout-tip}
## Exercises: Regex Patterns
:::
**Q1. What does the regex `^[A-Z]` match?**
```{r}
#| echo: false
#| label: "PAT_Q1"
check_question("A string that begins with an uppercase letter",
options = c(
"A string that begins with an uppercase letter",
"Any uppercase letter anywhere in the string",
"A string that consists entirely of uppercase letters",
"The end of a string followed by an uppercase letter"
),
type = "radio",
q_id = "PAT_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! The ^ anchor matches the position at the start of the string. [A-Z] matches one uppercase letter. Together, ^[A-Z] means 'a string (or line) whose first character is an uppercase letter'. Without the ^, [A-Z] would match any uppercase letter anywhere in the string.",
wrong = "Break it into two components: what does ^ anchor, and what does [A-Z] match?")
```
**Q2. What is the difference between `colou?r` and `colo[u]?r`?**
```{r}
#| echo: false
#| label: "PAT_Q2"
check_question("They are equivalent — both match 'colour' and 'color'; the ? makes the preceding element optional in both cases",
options = c(
"They are equivalent — both match 'colour' and 'color'; the ? makes the preceding element optional in both cases",
"colou?r matches only 'colour'; colo[u]?r matches both",
"colo[u]?r is invalid syntax — ? cannot follow a character class",
"colou?r matches 'color' only; colo[u]?r matches 'colour' only"
),
type = "radio",
q_id = "PAT_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! In colou?r, the ? applies to the literal character u — making it optional. In colo[u]?r, the ? applies to the character class [u], which also contains only u — so it is equally optional. Both patterns match 'color' (no u) and 'colour' (one u). The [u]? form is slightly more explicit and more easily extended: [ou]? would allow either 'o' or 'u' to appear optionally.",
wrong = "The ? quantifier makes the immediately preceding element optional. What is that element in each case?")
```
**Q3. You want to match words of exactly 5 characters that consist only of lowercase letters. Which pattern is correct?**
```{r}
#| echo: false
#| label: "PAT_Q3"
check_question("^[a-z]{5}$",
options = c(
"^[a-z]{5}$",
"[a-z]{5}",
"^[a-z]+{5}$",
"[a-z]{5,5}"
),
type = "radio",
q_id = "PAT_Q3",
random_answer_order = FALSE,
button_label = "Check answer",
right = "Correct! ^[a-z]{5}$ anchors the match at both ends of the string and requires exactly 5 lowercase letters. Without the anchors, [a-z]{5} would match any 5-letter sequence inside a longer word — for example, it would match 'gramm' inside 'grammar'. The {5,5} option works but is redundant (equivalent to {5}). The + before {5} is a syntax error.",
wrong = "The quantifier {5} means 'exactly 5'. But which patterns ensure the ENTIRE string is exactly 5 characters, rather than finding 5-character sequences within longer strings?")
```
---
# Key `stringr` Functions {#stringr}
::: {.callout-note}
## Section Overview
**What you will learn:** The `stringr` functions used most frequently with regular expressions, and when to use each.
**Key functions:** `str_detect()`, `str_count()`, `str_extract()`, `str_extract_all()`, `str_replace()`, `str_replace_all()`, `str_remove()`, `str_remove_all()`, `str_split()`, `str_locate()`
:::
The `stringr` package provides a consistent, user-friendly interface to regular expressions in R. All `stringr` functions follow the same pattern: the string comes first, the pattern second.
## `str_detect()` {-}
Returns `TRUE`/`FALSE` for each string in a vector. Most commonly used for filtering:
```{r str-detect}
words_sample <- c("grammar", "syntax", "morphology", "phonology",
"pragmatics", "grammarian", "ungrammatical")
# Which words contain "gram"?
str_detect(words_sample, "gram")
# Which words start with a vowel?
str_detect(words_sample, "^[aeiou]")
# Negate with !
words_sample[!str_detect(words_sample, "gram")]
```
## `str_count()` {-}
Counts non-overlapping occurrences of a pattern within each string:
```{r str-count}
# How many vowels in each word?
str_count(words_sample, "[aeiou]")
# How many times does the word "a" appear in the example text?
str_count(et, "\\ba\\b")
```
## `str_extract()` and `str_extract_all()` {-}
`str_extract()` returns the **first** match in each string. `str_extract_all()` returns **all** matches as a list:
```{r str-extract}
# Extract the first sequence of 3+ consonants
str_extract(words_sample, "[^aeiou]{3,}")
# Extract all sequences of digits from a mixed string
mixed <- c("price: 12.99 dollars", "code: A4-B12", "year: 2024")
str_extract_all(mixed, "\\d+")
# Extract all words longer than 8 characters from the example text
long_words <- str_extract_all(et, "\\b[[:alpha:]]{9,}\\b")[[1]]
sort(unique(long_words))
```
## `str_replace()` and `str_replace_all()` {-}
Replace the first (or all) occurrence(s) of a pattern with a replacement string. **Backreferences** (`\\1`, `\\2`) refer to captured groups in the replacement:
```{r str-replace}
# Replace first match
str_replace(sent, "[csm]at", "dog")
# Replace all matches
str_replace_all(sent, "[csm]at", "dog")
# Backreference: reverse the order of two words separated by "and"
str_replace_all("cats and dogs", "(\\w+) and (\\w+)", "\\2 and \\1")
# Add emphasis around all long words
str_replace_all(et, "\\b[[:alpha:]]{8,}\\b", "**\\0**") |>
substr(1, 120)
```
## `str_remove()` and `str_remove_all()` {-}
Shorthand for `str_replace(x, pattern, "")` and `str_replace_all(x, pattern, "")`:
```{r str-remove}
# Remove all punctuation from the sentence
str_remove_all(sent, "[[:punct:]]")
# Remove all digits
str_remove_all("Call us on 07-3365-1234", "\\d")
# Remove leading and trailing whitespace
str_remove_all(" linguistics ", "^\\s+|\\s+$")
# Keep only tokens of 4+ letters
long_tokens <- tokens[str_detect(tokens, "^[[:alpha:]]{4,}$")]
head(long_tokens, 10)
```
## `str_split()` {-}
Split strings on a pattern, returning a list:
```{r str-split}
# Split on whitespace
str_split("the cat sat on the mat", "\\s+")[[1]]
# Split on punctuation or whitespace
str_split("one,two; three four", "[[:punct:]\\s]+")[[1]]
# Split a text into sentences (approximate)
sentences <- str_split(et, "(?<=[.!?])\\s+")[[1]]
head(sentences, 3)
```
## `str_locate()` {-}
Returns the start and end positions of matches — useful when you need to know where in the string a pattern occurs:
```{r str-locate}
# Find where "grammar" first occurs in the example text
str_locate(et, "grammar")
# Find all occurrences
str_locate_all(et, "\\bgrammar\\b")[[1]]
```
---
::: {.callout-tip}
## Exercises: `stringr` Functions
:::
**Q1. What is the difference between `str_extract()` and `str_extract_all()`?**
```{r}
#| echo: false
#| label: "STR_Q1"
check_question("str_extract() returns only the first match in each string; str_extract_all() returns all matches as a list",
options = c(
"str_extract() returns only the first match in each string; str_extract_all() returns all matches as a list",
"str_extract() returns a character vector; str_extract_all() returns a logical vector",
"str_extract_all() is faster than str_extract() for long strings",
"They are identical — the _all suffix makes no difference"
),
type = "radio",
q_id = "STR_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! str_extract(x, pattern) returns a character vector of the same length as x, where each element is the first match in the corresponding string (or NA if no match). str_extract_all(x, pattern) returns a list of the same length as x, where each element is a character vector of ALL matches in the corresponding string. Use str_extract() when you expect one match per string; use str_extract_all() when you want every occurrence.",
wrong = "Think about how many matches each function returns per input string, and what data structure holds the result.")
```
**Q2. You want to capitalise all words longer than 5 characters in a text. Which `stringr` function would you use?**
```{r}
#| echo: false
#| label: "STR_Q2"
check_question("str_replace_all() — replace each matching word with an uppercase version using a replacement function",
options = c(
"str_replace_all() — replace each matching word with an uppercase version using a replacement function",
"str_extract_all() — extract the words and then capitalise",
"str_detect() — detect long words and then capitalise separately",
"str_remove_all() — remove the original words then insert capitals"
),
type = "radio",
q_id = "STR_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! str_replace_all() can accept a function as the replacement argument. The function receives each match and returns the replacement: str_replace_all(text, '\\\\b[a-z]{6,}\\\\b', toupper). str_extract_all() only extracts, it does not modify. str_detect() only returns TRUE/FALSE. str_remove_all() deletes matches rather than transforming them.",
wrong = "Which function both finds a pattern AND replaces it with something? And which one can apply a transformation function to each match?")
```
---
# Practical Applications {#practice}
::: {.callout-note}
## Section Overview
**What you will learn:** How to apply regular expressions to realistic corpus linguistics and text processing tasks.
**Tasks covered:** Corpus searching, text cleaning, metadata extraction, frequency analysis, and `dplyr` integration.
:::
## Searching a corpus: concordance-style extraction {-}
A common corpus task is retrieving all contexts in which a pattern appears. We simulate a small multi-document corpus:
```{r corpus-setup}
corpus <- data.frame(
doc_id = paste0("doc", 1:10),
register = rep(c("Academic", "News"), each = 5),
text = c(
"Grammar is the systematic study of the structure of a language.",
"Morphology examines how words are formed from smaller units called morphemes.",
"Syntax deals with the arrangement of words to form grammatical sentences.",
"Phonology studies the sound systems and phonological rules of languages.",
"Pragmatics investigates how context and intention affect meaning in communication.",
"Scientists announced a major breakthrough in natural language processing yesterday.",
"The new grammar checker software was released to the public on Monday morning.",
"Researchers found that bilingual speakers process syntax differently than monolinguals.",
"Language acquisition in children follows predictable phonological and syntactic stages.",
"The government launched a literacy program to improve grammar skills in schools."
),
stringsAsFactors = FALSE
)
```
```{r corpus-search}
# Find all documents containing words ending in "-ology"
corpus |>
dplyr::filter(str_detect(text, "\\b\\w+ology\\b")) |>
dplyr::select(doc_id, register, text)
```
```{r corpus-extract}
# Extract all "-ology" words from each document
corpus |>
dplyr::mutate(
ology_words = sapply(text, function(t)
paste(str_extract_all(t, "\\b\\w+ology\\b")[[1]], collapse = ", "))
) |>
dplyr::filter(ology_words != "") |>
dplyr::select(doc_id, ology_words)
```
## Counting pattern frequencies {-}
```{r count-patterns}
# Count occurrences of "grammar" (case-insensitive) per document
corpus |>
dplyr::mutate(
n_grammar = str_count(text, regex("grammar", ignore_case = TRUE))
) |>
dplyr::select(doc_id, register, n_grammar) |>
dplyr::arrange(dplyr::desc(n_grammar))
```
```{r count-subfields}
# Count how often each linguistic subfield is mentioned
subfields <- c("syntax", "morphology", "phonology", "pragmatics", "grammar")
subfield_counts <- sapply(subfields, function(sf)
sum(str_count(corpus$text, regex(sf, ignore_case = TRUE))))
data.frame(subfield = subfields, count = subfield_counts) |>
dplyr::arrange(dplyr::desc(count)) |>
flextable() |>
flextable::set_table_properties(width = .4, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 12) |>
flextable::fontsize(size = 12, part = "header") |>
flextable::align_text_col(align = "center") |>
flextable::set_caption(caption = "Frequency of linguistic subfield terms in the corpus.") |>
flextable::border_outer()
```
## Text cleaning {-}
Regular expressions are the primary tool for cleaning raw corpus text:
```{r text-clean}
raw_texts <- c(
" Grammar is the system of a language. ",
"Words like 'cat', 'bat', and 'hat' rhyme!",
"Phone: +61-7-3365-1234 | Email: info@uq.edu.au",
"Chapter 4: Syntax (pp. 112--145) — see also §3.2",
"The year\t2024\twas notable for advances in NLP."
)
raw_texts |>
# Normalise whitespace (collapse multiple spaces/tabs to one)
str_replace_all("\\s+", " ") |>
# Remove leading and trailing whitespace
str_trim() |>
# Remove content in parentheses
str_remove_all("\\(.*?\\)") |>
# Remove section references (§3.2 etc.)
str_remove_all("§\\d+\\.\\d+") |>
# Remove em dashes and following spaces
str_remove_all("—\\s*") |>
# Trim again after removals
str_trim()
```
## Extracting structured information {-}
A powerful application of regex is extracting structured information from free text:
```{r extract-structured}
# Simulate file names with embedded metadata
file_names <- c(
"speaker01_female_academic_2019.txt",
"speaker14_male_news_2021.txt",
"speaker07_female_fiction_2020.txt",
"speaker23_male_academic_2022.txt"
)
# Extract each metadata component
data.frame(
filename = file_names,
speaker_id = str_extract(file_names, "speaker\\d+"),
gender = str_extract(file_names, "(?<=_)(female|male)(?=_)"),
register = str_extract(file_names, "(?<=_(female|male)_)\\w+"),
year = str_extract(file_names, "\\d{4}")
)
```
## Case-insensitive matching {-}
By default, regex in `stringr` is case-sensitive. Use `regex(..., ignore_case = TRUE)` to match regardless of case:
```{r case-insensitive}
# Match "Grammar", "GRAMMAR", "grammar" etc.
str_detect(c("Grammar", "GRAMMAR", "grammar", "GrAmMaR"),
regex("grammar", ignore_case = TRUE))
# Extract all mentions of a term regardless of capitalisation
str_extract_all(et, regex("\\bgrammar\\w*\\b", ignore_case = TRUE))[[1]]
```
## Regex in `dplyr` pipelines {-}
Regular expressions integrate seamlessly with `dplyr` for filtering and creating new columns:
```{r dplyr-regex}
corpus |>
dplyr::filter(str_detect(text, regex("syntax|morphology", ignore_case = TRUE))) |>
dplyr::mutate(
primary_topic = str_extract(text,
regex("syntax|morphology|phonology|pragmatics|grammar",
ignore_case = TRUE)),
n_words = str_count(text, "\\S+"),
has_definition = str_detect(text, "\\bis\\b|\\bdeals with\\b|\\bexamines\\b")
) |>
dplyr::select(doc_id, register, primary_topic, n_words, has_definition)
```
---
::: {.callout-tip}
## Exercises: Practical Applications
:::
**Q1. What regular expression would you use to extract all words that contain at least one digit (e.g., "A4", "mp3", "COVID-19")?**
```{r}
#| echo: false
#| label: "PRAC_Q1"
check_question("\\\\w*\\\\d\\\\w* — a word character sequence containing at least one digit",
options = c(
"\\\\w*\\\\d\\\\w* — a word character sequence containing at least one digit",
"\\\\d+ — one or more digits",
"[0-9] — a single digit",
"\\\\b\\\\d\\\\b — a word boundary around a single digit"
),
type = "radio",
q_id = "PRAC_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! \\w*\\d\\w* matches: zero or more word characters (\\w*), then at least one digit (\\d), then zero or more word characters (\\w*). This captures tokens like 'A4', 'mp3', 'COVID-19' (hyphen aside). The plain \\d+ would extract only the digit sequence, not the surrounding letters.",
wrong = "You need a pattern that matches the whole token (letters + digits), not just the digit itself. What pattern matches a sequence of word characters that includes at least one digit?")
```
**Q2. You want to extract the domain name from email addresses (the part after `@` and before the final `.`). Which regex extracts `uq` from `user@uq.edu.au`?**
```{r}
#| echo: false
#| label: "PRAC_Q2"
check_question("(?<=@)[\\w.]+(?=\\.\\w+$) — a lookbehind for @ and lookahead for the final .extension",
options = c(
"(?<=@)[\\w.]+(?=\\.\\w+$) — a lookbehind for @ and lookahead for the final .extension",
"@\\w+ — the @ sign followed by word characters",
"\\w+\\.[\\w.]+ — word characters and dots after the @",
"\\w+(?=\\.edu|\\.com|\\.org) — lookahead for known domain suffixes"
),
type = "radio",
q_id = "PRAC_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! (?<=@) is a positive lookbehind requiring @ immediately before the match. [\\w.]+ matches word characters and dots (the full domain like 'uq.edu'). (?=\\.\\w+$) is a positive lookahead requiring a dot followed by word characters at the end of the string ('.au'). The lookbehind and lookahead ensure we are positioned correctly without including @ or the final extension in the match.",
wrong = "Think about lookahead and lookbehind: how can you match the content between @ and the last .extension without including the @ or the extension in the match?")
```
**Q3. What does `str_replace_all(text, \"(\\\\w+) and (\\\\w+)\", \"\\\\2 and \\\\1\")` do?**
```{r}
#| echo: false
#| label: "PRAC_Q3"
check_question("It swaps the two words on either side of 'and', using backreferences \\\\1 and \\\\2 to refer to the captured groups",
options = c(
"It swaps the two words on either side of 'and', using backreferences \\\\1 and \\\\2 to refer to the captured groups",
"It removes the word 'and' and joins the surrounding words",
"It replaces every occurrence of 'and' with \\\\2 and \\\\1 literally",
"It matches only if there are exactly two words separated by 'and'"
),
type = "radio",
q_id = "PRAC_Q3",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! The pattern (\\w+) and (\\w+) captures two words in groups 1 and 2. The replacement \\2 and \\1 puts them back in reverse order. So 'cats and dogs' becomes 'dogs and cats'. Backreferences in str_replace_all() replacements use \\1, \\2, etc. to refer to what was captured in the corresponding group.",
wrong = "The parentheses () create capturing groups numbered left to right. In the replacement string, \\1 inserts what group 1 captured, and \\2 inserts what group 2 captured. What does reversing their order do?")
```
---
# Corpus Search Exercises {#exercises}
::: {.callout-note}
## Section Overview
**Ten practical exercises covering the most common corpus-search regex tasks.**
Each question asks you to identify the correct regular expression for a realistic search task on a tokenised text vector. All answers use `stringr::str_detect()` applied to a character vector called `text`.
:::
```{r exercise-setup, echo=FALSE}
text <- c(
"Walking", "walked", "walks", "walk", "walker",
"ungrammatical", "unusual", "undo", "Unknown",
"The", "year", "2024", "saw", "COVID-19",
"well-being", "self-aware", "long-term",
"London", "Paris", "Grammar", "Syntax",
"agreement", "book", "feet", "see", "moon",
"running", "working", "thinking", "helping",
"cat@uq.edu.au", "info@ladal.edu.au",
"Where", "are", "you?", "What", "is", "this?",
"mp3", "A4", "type2"
)
```
**Q1. Which regex extracts all forms of *walk* from a tokenised text (walk, walks, walked, walking, walker)?**
```{r}
#| echo: false
#| label: "EX_Q1"
check_question('\\\\bwalk\\\\w*\\\\b — word boundary + "walk" + zero or more word characters',
options = c(
'\\\\bwalk\\\\w*\\\\b — word boundary + "walk" + zero or more word characters',
'"walk" — a plain literal match',
'"walk.*" — "walk" followed by any characters',
'"[Ww]alk" — capital or lowercase W followed by alk'
),
type = "radio",
q_id = "EX_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! \\bwalk\\w*\\b matches the literal string "walk" at a word boundary, followed by zero or more word characters (\\w*), up to the next word boundary. This captures walk, walks, walked, walking, walker, and any other form beginning with "walk".',
wrong = 'You need a pattern that starts with "walk" and captures any suffix (s, ed, ing, er, ...). Which pattern anchors to word boundaries and allows zero or more following word characters?')
```
**Q2. Which regex extracts all words beginning with "un" (e.g., *ungrammatical*, *unusual*, *undo*)?**
```{r}
#| echo: false
#| label: "EX_Q2"
check_question('\\\\b[Uu]n\\\\w+ — word boundary + "un" (case-flexible) + one or more word characters',
options = c(
'\\\\b[Uu]n\\\\w+ — word boundary + "un" (case-flexible) + one or more word characters',
'"un" — matches any token containing "un" anywhere',
'"^un" — anchors to string start (works on single tokens)',
'"\\\\bun" — word boundary before "un" with no suffix requirement'
),
type = "radio",
q_id = "EX_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! \\b[Uu]n\\w+ matches: a word boundary (\\b), then "un" or "Un", then one or more word characters (\\w+). The \\w+ ensures we match a word that continues beyond "un" — avoiding matching the standalone token "un". The [Uu] handles both "ungrammatical" and "Unknown".',
wrong = 'What pattern starts at a word boundary, matches the prefix "un", and then requires at least one more character to follow?')
```
**Q3. Which regex finds all numeric tokens (whole numbers like *2024*, *42*, *100*)?**
```{r}
#| echo: false
#| label: "EX_Q3"
check_question('\\\\b\\\\d+\\\\b — word boundaries around one or more digits',
options = c(
'\\\\b\\\\d+\\\\b — word boundaries around one or more digits',
'"[0-9]" — matches a single digit anywhere in the token',
'"\\\\d" — matches a single digit',
'"\\\\d*" — zero or more digits (matches every string)'
),
type = "radio",
q_id = "EX_Q3",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! \\b\\d+\\b matches one or more digits (\\d+) as a complete token (word boundaries on both sides). This matches "2024", "42", "100" but not "COVID-19" or "mp3". Use \\d+ without boundaries if you want digit sequences embedded in mixed tokens.',
wrong = 'You need to match a complete numeric token — all digits, nothing else. Which pattern uses \\d and word boundaries correctly?')
```
**Q4. Which regex extracts all words ending in *-ing* (e.g., *running*, *working*, *thinking*)?**
```{r}
#| echo: false
#| label: "EX_Q4"
check_question('\\\\b\\\\w+ing\\\\b — one or more word characters followed by "ing" at a word boundary',
options = c(
'\\\\b\\\\w+ing\\\\b — one or more word characters followed by "ing" at a word boundary',
'"ing$" — "ing" at the end of the string (works on single tokens)',
'".*ing" — any characters followed by "ing"',
'"[ing]" — any of the characters i, n, or g'
),
type = "radio",
q_id = "EX_Q4",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! \\b\\w+ing\\b matches: a word boundary, one or more word characters (\\w+ ensures there is a stem before the suffix), then "ing", then a word boundary. This captures running, walking, thinking, helping. On a tokenised vector, "ing$" (anchoring to string end) works equally well.',
wrong = 'You need a pattern that ends with "ing" and has a word boundary after it. What ensures there is also a stem before the suffix?')
```
**Q5. Which regex matches email addresses (e.g., *cat\@uq.edu.au*, *info\@ladal.edu.au*)?**
```{r}
#| echo: false
#| label: "EX_Q5"
check_question('"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\\\.[A-Za-z]{2,}" — local part + @ + domain + TLD',
options = c(
'"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\\\.[A-Za-z]{2,}" — local part + @ + domain + TLD',
'"\\\\w+@\\\\w+" — word characters on both sides of @',
'"@" — any token containing the @ symbol',
'"\\\\S+@\\\\S+" — non-whitespace characters around @'
),
type = "radio",
q_id = "EX_Q5",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! The full email pattern [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,} captures: the local part (letters, digits, dots, underscores, %, +, -), the @ symbol, the domain, a literal dot, and a top-level domain of at least 2 letters. Simpler patterns like \\w+@\\w+ miss dots and hyphens in domains.',
wrong = 'An email address has three structural components: a local part, the @ symbol, and a domain with a top-level domain. Which pattern captures all three correctly?')
```
**Q6. Which regex identifies tokens that contain at least one digit mixed with letters (e.g., *mp3*, *A4*, *COVID-19*, *type2*)?**
```{r}
#| echo: false
#| label: "EX_Q6"
check_question('"\\\\w*\\\\d\\\\w*" — word characters, then a digit, then word characters',
options = c(
'"\\\\w*\\\\d\\\\w*" — word characters, then a digit, then word characters',
'"\\\\d+" — one or more digits (matches pure numbers too)',
'"[a-z]\\\\d" — a letter followed immediately by a digit',
'"\\\\b\\\\d\\\\b" — a digit as a complete token'
),
type = "radio",
q_id = "EX_Q6",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! \\w*\\d\\w* matches zero or more word characters, then at least one digit, then zero or more word characters. This captures mp3, A4, and type2. Note it also matches pure numbers like "2024" — if you want to exclude those, use [a-zA-Z]\\w*\\d|\\d\\w*[a-zA-Z] to require at least one letter AND one digit.',
wrong = 'You need a token that contains both letters and at least one digit. Which pattern allows word characters before AND after a digit, with zero or more letters on either side?')
```
**Q7. Which regex extracts hyphenated compound words (e.g., *well-being*, *self-aware*, *long-term*)?**
```{r}
#| echo: false
#| label: "EX_Q7"
check_question('"\\\\b\\\\w+-\\\\w+\\\\b" — word characters, hyphen, word characters within word boundaries',
options = c(
'"\\\\b\\\\w+-\\\\w+\\\\b" — word characters, hyphen, word characters within word boundaries',
'"-" — any token containing a hyphen',
'"\\\\w-\\\\w" — one word character, hyphen, one word character',
'"[a-z]-[a-z]" — a lowercase letter, hyphen, lowercase letter'
),
type = "radio",
q_id = "EX_Q7",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! \\b\\w+-\\w+\\b matches: a word boundary, one or more word characters (the first element), a literal hyphen, one or more word characters (the second element), a word boundary. This captures well-being, self-aware, long-term. For multi-part compounds like "mother-in-law", extend to \\w+(-\\w+)+.',
wrong = 'A hyphenated word has one or more word characters before the hyphen and one or more after. Which pattern captures full word elements (not just single characters) on both sides?')
```
**Q8. Which regex finds capitalised tokens — words beginning with an uppercase letter followed by lowercase letters (e.g., proper nouns like *London*, *Paris*, *Grammar*)?**
```{r}
#| echo: false
#| label: "EX_Q8"
check_question('"\\\\b[A-Z][a-z]+\\\\b" — uppercase first letter followed by one or more lowercase letters',
options = c(
'"\\\\b[A-Z][a-z]+\\\\b" — uppercase first letter followed by one or more lowercase letters',
'"[A-Z]" — any uppercase letter anywhere in the token',
'"[A-Z]+" — one or more uppercase letters (matches ALL-CAPS too)',
'"^[A-Z]" — string starting with uppercase (same as \\\\b[A-Z] on tokens)'
),
type = "radio",
q_id = "EX_Q8",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! \\b[A-Z][a-z]+\\b matches: a word boundary, one uppercase letter, one or more lowercase letters, a word boundary. This matches London, Paris, Grammar, Syntax — title-case words. It excludes ALL-CAPS tokens like "NLP" or "COVID". Note: at sentence boundaries, even non-proper nouns are capitalised; for proper-noun detection you would need additional context filtering.',
wrong = 'You want Title Case: one uppercase letter at the start, then lowercase letters. Which pattern enforces both the case of the first character and the case of those that follow?')
```
**Q9. Which regex finds tokens that are questions ending with a question mark (e.g., *you?*, *this?*)?**
```{r}
#| echo: false
#| label: "EX_Q9"
check_question('"\\\\?" — matches any token ending in a literal question mark (on a tokenised vector)',
options = c(
'"\\\\?" — matches any token ending in a literal question mark (on a tokenised vector)',
'"?" — zero or one of the preceding element (quantifier, not literal)',
'".*\\\\?$" — any characters followed by ? at string end',
'"\\\\?$" — a question mark at the end of the string'
),
type = "radio",
q_id = "EX_Q9",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! On a tokenised vector, \\? detects any token containing a literal question mark. Since ? is a quantifier in regex, it must be escaped as \\? to match the literal character. \\?$ is equally correct — it anchors the ? to the end of the string. .*\\?$ works too and is the most explicit form. The unescaped "?" alone is a regex quantifier meaning "zero or one of the preceding element".',
wrong = 'The question mark has special meaning in regex (it is a quantifier). How do you match a LITERAL question mark character?')
```
**Q10. Which regex finds tokens containing double vowels (e.g., *agreement*, *book*, *see*, *moon*)?**
```{r}
#| echo: false
#| label: "EX_Q10"
check_question('"[aeiouAEIOU]{2}" — two consecutive vowels (case-insensitive vowel class, quantified)',
options = c(
'"[aeiouAEIOU]{2}" — two consecutive vowels (case-insensitive vowel class, quantified)',
'"[aeiou][aeiou]" — same as above but written out explicitly',
'"[aeiou]+" — one or more vowels (also matches single vowels)',
'"(aa|ee|ii|oo|uu)" — only identical vowel pairs'
),
type = "radio",
q_id = "EX_Q10",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! [aeiouAEIOU]{2} matches any two consecutive vowels (same or different: "oo", "ea", "ai", "ou"). Both [aeiouAEIOU]{2} and [aeiou][aeiou] are correct — the {2} form is more compact and easier to extend to {3} or {2,}. [aeiou]+ also matches double vowels but would return TRUE for any vowel at all. (aa|ee|ii|oo|uu) only matches identical doubled vowels, missing "ea", "ou", "ai".',
wrong = 'You want exactly two consecutive vowel characters. A vowel character class covers [aeiou]. What quantifier specifies exactly two occurrences?')
```
---
# Quick Reference {#reference}
::: {.callout-note}
## Section Overview
**A compact reference for the most commonly used regex elements in R.**
:::
## Pattern summary table {-}
```{r ref-table, echo=FALSE}
data.frame(
Pattern = c(
".", "^", "$", "\\\\b", "\\\\B",
"[abc]", "[^abc]", "[a-z]", "[[:alpha:]]", "[[:digit:]]", "[[:punct:]]",
"*", "+", "?", "{n}", "{n,}", "{n,m}",
"*?", "+?", "{n,m}?",
"(abc)", "(?:abc)", "a|b",
"\\\\w", "\\\\d", "\\\\s",
"\\\\W", "\\\\D", "\\\\S",
"(?=...)", "(?!...)", "(?<=...)", "(?<!...)"
),
Meaning = c(
"Any character except newline",
"Start of string / line",
"End of string / line",
"Word boundary",
"Non-word boundary",
"One of: a, b, or c",
"Not a, b, or c",
"Lowercase letter",
"Any letter",
"Any digit",
"Any punctuation",
"0 or more (greedy)",
"1 or more (greedy)",
"0 or 1 — optional (greedy)",
"Exactly n times",
"n or more times (greedy)",
"Between n and m times (greedy)",
"0 or more (lazy)",
"1 or more (lazy)",
"Between n and m times (lazy)",
"Capturing group",
"Non-capturing group",
"a or b",
"Word character [a-zA-Z0-9_]",
"Digit [0-9]",
"Whitespace",
"Non-word character",
"Non-digit",
"Non-whitespace",
"Positive lookahead",
"Negative lookahead",
"Positive lookbehind",
"Negative lookbehind"
)
) |>
flextable() |>
flextable::set_table_properties(width = .75, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "Quick reference: regular expression patterns in R (lazy quantifiers added).") |>
flextable::border_outer()
```
## `stringr` function summary {-}
```{r fn-table, echo=FALSE}
data.frame(
Function = c(
"str_detect(x, p)",
"str_count(x, p)",
"str_extract(x, p)",
"str_extract_all(x, p)",
"str_replace(x, p, r)",
"str_replace_all(x, p, r)",
"str_remove(x, p)",
"str_remove_all(x, p)",
"str_split(x, p)",
"str_locate(x, p)",
"str_locate_all(x, p)",
"str_starts(x, p)",
"str_ends(x, p)"
),
Returns = c(
"logical vector — does p match?",
"integer vector — how many matches?",
"character vector — first match (NA if none)",
"list of character vectors — all matches",
"character vector — first match replaced",
"character vector — all matches replaced",
"character vector — first match removed",
"character vector — all matches removed",
"list of character vectors — parts between matches",
"integer matrix — start and end of first match",
"list of integer matrices — all match positions",
"logical — does x start with p?",
"logical — does x end with p?"
)
) |>
flextable() |>
flextable::set_table_properties(width = .99, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "Key stringr functions for use with regular expressions.") |>
flextable::border_outer()
```
---
# Citation & Session Info {.unnumbered}
::: {.callout-note}
## Citation
```{r citation-callout, echo=FALSE, results='asis'}
cat(
params$author, ". ",
params$year, ". *",
params$title, "*. ",
params$institution, ". ",
"url: ", params$url, " ",
"(Version ", params$version, "), ",
"doi: ", params$doi, ".",
sep = ""
)
```
```{r citation-bibtex, echo=FALSE, results='asis'}
key <- paste0(
tolower(gsub(" ", "", gsub(",.*", "", params$author))),
params$year,
tolower(gsub("[^a-zA-Z]", "", strsplit(params$title, " ")[[1]][1]))
)
cat("```\n")
cat("@manual{", key, ",\n", sep = "")
cat(" author = {", params$author, "},\n", sep = "")
cat(" title = {", params$title, "},\n", sep = "")
cat(" year = {", params$year, "},\n", sep = "")
cat(" note = {", params$url, "},\n", sep = "")
cat(" organization = {", params$institution, "},\n", sep = "")
cat(" edition = {", params$version, "}\n", sep = "")
cat(" doi = {", params$doi, "}\n", sep = "")
cat("}\n```\n")
```
:::
```{r fin}
sessionInfo()
```
::: {.callout-note}
## AI Transparency Statement
This tutorial was re-developed with the assistance of **Claude** (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the `checkdown` quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.
:::
[Back to top](#intro)
[Back to HOME](/index.html)
# References {.unnumbered}